P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
This report uses dataset from the referred above to explore what variables are correlated with the quality of red wine.
## [1] 1599 13
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Our dataset has 13 variables with 1599 obs.Variable ‘X’ seems like an index number of each ob, so we will just ignore it.The type of ‘quality’ variable is int, and other variables are all numeric variables.We will plot the histogram of each variable to see the distribution of data and outliers if any.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
From the plot we can tell most of the wine quality are 5 or 6. Only a few are with a quality of 3 or 8 which is very low or very high. This seems consistent with our common sense.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
##
## 4.6 4.7 4.9 5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6 6.1
## 1 1 1 6 4 6 4 5 1 14 2 4 9 13 16
## 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7 7.1 7.2 7.3 7.4 7.5 7.6
## 20 14 25 17 37 28 46 38 50 57 67 44 44 52 46
## 7.7 7.8 7.9 8 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 9 9.1
## 49 53 42 42 26 45 40 26 19 27 24 34 33 26 29
## 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 10 10.1 10.2 10.3 10.4 10.5 10.6
## 16 22 17 14 17 9 15 26 23 10 19 11 21 12 14
## 10.7 10.8 10.9 11 11.1 11.2 11.3 11.4 11.5 11.6 11.7 11.8 11.9 12 12.1
## 10 10 8 3 9 5 7 5 13 12 3 3 12 7 1
## 12.2 12.3 12.4 12.5 12.6 12.7 12.8 12.9 13 13.2 13.3 13.4 13.5 13.7 13.8
## 4 5 4 7 4 4 5 2 3 3 3 1 1 2 1
## 14 14.3 15 15.5 15.6 15.9
## 1 1 2 2 2 1
The distribution of fixed acidity is almost normal distribution, with the peak appears around fixed acidity of 7.2.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
##
## 0.12 0.16 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27
## 3 2 10 2 3 6 6 5 13 7 16 14
## 0.28 0.29 0.295 0.3 0.305 0.31 0.315 0.32 0.33 0.34 0.35 0.36
## 23 16 1 16 2 30 2 23 20 30 22 38
## 0.365 0.37 0.38 0.39 0.395 0.4 0.41 0.415 0.42 0.43 0.44 0.45
## 2 24 35 35 2 37 33 3 31 43 23 22
## 0.46 0.47 0.475 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.545 0.55
## 31 21 2 24 35 46 24 33 29 31 5 20
## 0.56 0.565 0.57 0.575 0.58 0.585 0.59 0.595 0.6 0.605 0.61 0.615
## 34 1 28 3 38 3 39 1 47 3 27 6
## 0.62 0.625 0.63 0.635 0.64 0.645 0.65 0.655 0.66 0.665 0.67 0.675
## 24 3 29 9 27 12 16 7 26 3 23 3
## 0.68 0.685 0.69 0.695 0.7 0.705 0.71 0.715 0.72 0.725 0.73 0.735
## 12 11 23 7 10 6 3 12 5 9 6 8
## 0.74 0.745 0.75 0.755 0.76 0.765 0.77 0.775 0.78 0.785 0.79 0.795
## 11 5 6 3 5 5 6 4 10 8 2 2
## 0.8 0.805 0.81 0.815 0.82 0.825 0.83 0.835 0.84 0.845 0.85 0.855
## 3 1 2 3 5 1 4 4 8 1 2 3
## 0.86 0.865 0.87 0.875 0.88 0.885 0.89 0.895 0.9 0.91 0.915 0.92
## 2 1 4 2 5 5 1 1 3 3 4 1
## 0.935 0.95 0.955 0.96 0.965 0.975 0.98 1 1.005 1.01 1.02 1.025
## 2 1 1 3 3 1 3 3 1 1 4 1
## 1.035 1.04 1.07 1.09 1.115 1.13 1.18 1.185 1.24 1.33 1.58
## 1 3 1 1 1 1 1 1 1 2 1
The volatile acidity is between 0.12 and 1.58, and there are a few outliers from 1.1 to 1.6.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
##
## 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14
## 132 33 50 30 29 20 24 22 33 30 35 15 27 18 21
## 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29
## 19 9 16 22 21 25 33 27 25 51 27 38 20 19 21
## 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44
## 30 30 32 25 24 13 20 19 14 28 29 16 29 15 23
## 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59
## 22 19 18 23 68 20 13 17 14 13 12 8 9 9 8
## 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74
## 9 2 1 10 9 7 14 2 11 4 2 1 1 3 4
## 0.75 0.76 0.78 0.79 1
## 1 3 1 1 1
There are more obs with a citric acid of 0 than with other citric value.and there is an outlier at where the citric value is 1.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
##
## 0.9 1.2 1.3 1.4 1.5 1.6 1.65 1.7 1.75 1.8 1.9 2 2.05 2.1 2.15
## 2 8 5 35 30 58 2 76 2 129 117 156 2 128 2
## 2.2 2.25 2.3 2.35 2.4 2.5 2.55 2.6 2.65 2.7 2.8 2.85 2.9 2.95 3
## 131 1 109 1 86 84 1 79 1 39 49 1 24 1 25
## 3.1 3.2 3.3 3.4 3.45 3.5 3.6 3.65 3.7 3.75 3.8 3.9 4 4.1 4.2
## 7 15 11 15 1 2 8 1 4 1 8 6 11 6 5
## 4.25 4.3 4.4 4.5 4.6 4.65 4.7 4.8 5 5.1 5.15 5.2 5.4 5.5 5.6
## 1 8 4 4 6 2 1 3 1 5 1 3 1 8 6
## 5.7 5.8 5.9 6 6.1 6.2 6.3 6.4 6.55 6.6 6.7 7 7.2 7.3 7.5
## 1 4 3 4 4 3 2 3 2 2 2 1 1 1 1
## 7.8 7.9 8.1 8.3 8.6 8.8 8.9 9 10.7 11 12.9 13.4 13.8 13.9 15.4
## 2 3 2 3 1 2 1 1 1 2 1 1 2 1 2
## 15.5
## 1
The histogram is long-tailed .Most of the residual sugar of the obs drop in between 1.5 and 3.0.We will log transform the residual sugar variable and plot a histogram of the transformed variable again to have a better view.
Log transformed residual sugar histogram, the histogram is still right biased.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
##
## 0.012 0.034 0.038 0.039 0.041 0.042 0.043 0.044 0.045 0.046 0.047 0.048
## 2 1 2 4 4 3 1 5 4 4 4 8
## 0.049 0.05 0.051 0.052 0.053 0.054 0.055 0.056 0.057 0.058 0.059 0.06
## 8 12 1 10 5 13 8 9 10 14 17 16
## 0.061 0.062 0.063 0.064 0.065 0.066 0.067 0.068 0.069 0.07 0.071 0.072
## 11 24 22 20 23 32 27 30 21 35 47 24
## 0.073 0.074 0.075 0.076 0.077 0.078 0.079 0.08 0.081 0.082 0.083 0.084
## 35 55 45 51 47 51 43 66 40 46 35 49
## 0.085 0.086 0.087 0.088 0.089 0.09 0.091 0.092 0.093 0.094 0.095 0.096
## 25 31 25 32 25 21 19 22 21 19 23 18
## 0.097 0.098 0.099 0.1 0.101 0.102 0.103 0.104 0.105 0.106 0.107 0.108
## 18 12 8 13 5 10 7 16 6 8 9 1
## 0.109 0.11 0.111 0.112 0.113 0.114 0.115 0.116 0.117 0.118 0.119 0.12
## 3 8 7 6 1 11 5 2 4 8 3 3
## 0.121 0.122 0.123 0.124 0.125 0.126 0.127 0.128 0.132 0.136 0.137 0.143
## 2 7 6 3 1 1 1 1 4 1 1 1
## 0.145 0.146 0.147 0.148 0.152 0.153 0.157 0.159 0.161 0.165 0.166 0.168
## 1 1 1 1 2 1 3 1 1 1 3 1
## 0.169 0.17 0.171 0.172 0.174 0.176 0.178 0.186 0.19 0.194 0.2 0.205
## 1 1 2 1 1 1 2 1 1 1 1 2
## 0.213 0.214 0.216 0.222 0.226 0.23 0.235 0.236 0.241 0.243 0.25 0.263
## 1 3 1 1 2 1 1 1 1 1 1 1
## 0.267 0.27 0.332 0.337 0.341 0.343 0.358 0.36 0.368 0.369 0.387 0.401
## 1 1 1 1 1 1 1 1 1 1 1 1
## 0.403 0.413 0.414 0.415 0.422 0.464 0.467 0.61 0.611
## 1 1 2 3 1 1 1 1 1
The histogram is right biased with a long tail.most of the chlorides values are between 0.04 and 0.12.
Log transformed Chlorides histogram, almost nornal distrubuted.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
##
## 1 2 3 4 5 5.5 6 7 8 9 10 11 12 13 14
## 3 1 49 41 104 1 138 71 56 62 79 59 75 57 50
## 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## 78 61 60 46 39 30 41 22 32 34 24 32 29 23 23
## 30 31 32 33 34 35 36 37 37.5 38 39 40 40.5 41 42
## 16 20 22 11 18 15 11 3 2 9 5 6 1 7 3
## 43 45 46 47 48 50 51 52 53 54 55 57 66 68 72
## 3 3 1 1 4 2 4 3 1 1 2 1 1 2 1
The histogram reaches the peak at around 5 and then drop dowm.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
##
## 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 3 4 14 14 27 26 29 28 33 35 26 27 35 29 33
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
## 25 25 34 36 27 24 30 43 20 14 32 20 17 20 26
## 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
## 12 26 31 16 17 14 26 18 23 20 17 24 21 21 11
## 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
## 11 15 14 20 13 10 6 14 9 18 9 9 13 10 17
## 66 67 68 69 70 71 72 73 74 75 76 77 77.5 78 79
## 9 12 10 8 8 7 10 7 8 5 3 8 2 4 5
## 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94
## 4 6 4 2 6 9 10 6 14 9 5 7 8 2 8
## 95 96 98 99 100 101 102 103 104 105 106 108 109 110 111
## 4 5 7 6 3 4 6 2 5 5 6 3 4 6 3
## 112 113 114 115 116 119 120 121 122 124 125 126 127 128 129
## 3 4 2 2 1 7 2 4 3 3 2 1 2 2 3
## 130 131 133 134 135 136 139 140 141 142 143 144 145 147 148
## 1 3 3 2 2 2 1 1 3 1 2 3 3 3 2
## 149 151 152 153 155 160 165 278 289
## 1 2 1 1 1 1 1 1 1
The histogram is right biased and there are some outliers at around 278 and 289.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
##
## 0.99007 0.9902 0.99064 0.9908 0.99084 0.9912 0.9915 0.99154 0.99157
## 2 1 2 1 1 1 1 1 1
## 0.9916 0.99162 0.9917 0.99182 0.99191 0.9921 0.9922 0.99235 0.99236
## 2 1 1 2 1 1 2 1 1
## 0.9924 0.99242 0.99252 0.99256 0.99258 0.99264 0.9927 0.9928 0.99286
## 3 2 1 1 3 1 1 2 1
## 0.9929 0.99292 0.99294 0.99306 0.99314 0.99316 0.99318 0.9932 0.99322
## 1 1 2 1 1 2 1 1 1
## 0.99323 0.99328 0.9933 0.99331 0.99332 0.99334 0.99336 0.9934 0.99341
## 1 1 1 2 1 1 1 4 1
## 0.99344 0.99346 0.99348 0.9935 0.99352 0.99354 0.99356 0.99357 0.99358
## 1 3 1 1 2 2 4 1 3
## 0.9936 0.99362 0.99364 0.9937 0.99371 0.99374 0.99376 0.99378 0.99379
## 2 2 1 2 2 2 3 3 1
## 0.9938 0.99384 0.99385 0.99386 0.99387 0.99388 0.99392 0.99394 0.99395
## 1 1 1 1 1 2 2 1 1
## 0.99396 0.99397 0.994 0.99402 0.99408 0.9941 0.99414 0.99416 0.99417
## 3 1 2 4 3 1 2 1 1
## 0.99418 0.99419 0.9942 0.99425 0.99426 0.99428 0.9943 0.99434 0.99437
## 2 2 3 1 1 1 2 1 1
## 0.99438 0.99439 0.9944 0.99444 0.99448 0.99451 0.99454 0.99456 0.99458
## 5 1 3 4 4 1 1 1 4
## 0.99459 0.9946 0.99462 0.99464 0.99467 0.99468 0.9947 0.99471 0.99472
## 1 5 2 2 2 1 6 3 3
## 0.99473 0.99474 0.99476 0.99478 0.99479 0.9948 0.99483 0.99484 0.99486
## 1 1 3 2 1 9 1 3 1
## 0.99488 0.99489 0.9949 0.99491 0.99492 0.99494 0.99495 0.99496 0.99498
## 4 3 4 1 2 4 2 1 5
## 0.99499 0.995 0.99501 0.99502 0.99504 0.99506 0.99508 0.99509 0.9951
## 1 10 1 2 2 1 3 1 4
## 0.99512 0.99514 0.99516 0.99517 0.99518 0.99519 0.9952 0.99521 0.99522
## 2 5 6 1 3 1 9 1 4
## 0.99523 0.99524 0.99525 0.99526 0.99528 0.99529 0.9953 0.99531 0.99532
## 1 4 2 2 3 1 4 2 1
## 0.99533 0.99534 0.99536 0.99538 0.9954 0.99541 0.99542 0.99543 0.99544
## 1 6 2 11 4 1 1 2 1
## 0.99545 0.99546 0.99547 0.99549 0.9955 0.99551 0.99552 0.99553 0.99554
## 3 7 2 2 14 3 5 1 3
## 0.99555 0.99556 0.99557 0.99558 0.9956 0.99562 0.99564 0.99565 0.99566
## 1 2 3 3 14 4 2 3 4
## 0.99568 0.99569 0.9957 0.99572 0.99573 0.99574 0.99575 0.99576 0.99577
## 4 1 6 9 1 2 2 5 3
## 0.99578 0.9958 0.99581 0.99582 0.99584 0.99585 0.99586 0.99587 0.99588
## 3 14 1 1 2 3 6 2 4
## 0.99589 0.9959 0.99592 0.99593 0.99594 0.99596 0.99598 0.99599 0.996
## 1 13 4 2 1 2 2 2 13
## 0.99603 0.99604 0.99605 0.99606 0.99608 0.99609 0.9961 0.99612 0.99613
## 2 3 3 2 2 1 10 6 4
## 0.99614 0.99615 0.99616 0.99617 0.99619 0.9962 0.99621 0.99622 0.99623
## 2 5 7 1 1 28 1 5 2
## 0.99624 0.99625 0.99627 0.99628 0.99629 0.9963 0.99631 0.99632 0.99633
## 3 3 3 3 2 15 1 4 4
## 0.99634 0.99635 0.99636 0.99638 0.99639 0.9964 0.99641 0.99642 0.99643
## 3 1 5 5 2 25 1 3 1
## 0.99645 0.99646 0.99647 0.99648 0.99649 0.9965 0.99651 0.99652 0.99654
## 1 1 2 3 1 11 1 6 2
## 0.99655 0.99656 0.99658 0.99659 0.9966 0.99661 0.99664 0.99665 0.99666
## 6 5 1 2 23 1 3 1 3
## 0.99667 0.99668 0.99669 0.9967 0.99672 0.99674 0.99675 0.99676 0.99677
## 1 4 2 13 5 2 5 3 2
## 0.99678 0.9968 0.99682 0.99683 0.99684 0.99685 0.99686 0.99688 0.99689
## 1 35 2 2 1 8 3 2 4
## 0.9969 0.99692 0.99693 0.99694 0.99695 0.99697 0.99698 0.99699 0.997
## 18 4 2 3 1 1 1 1 24
## 0.99701 0.99702 0.99704 0.99705 0.99706 0.99708 0.99709 0.9971 0.99712
## 2 4 3 1 2 4 1 13 4
## 0.99713 0.99714 0.99716 0.99717 0.99718 0.99719 0.9972 0.99721 0.99722
## 2 2 2 1 3 1 36 1 1
## 0.99724 0.99725 0.99726 0.99727 0.99728 0.99729 0.9973 0.99732 0.99733
## 4 1 1 1 3 1 18 3 1
## 0.99734 0.99735 0.99736 0.99738 0.99739 0.9974 0.99743 0.99744 0.99745
## 4 6 5 4 1 22 2 2 9
## 0.99746 0.99747 0.99748 0.9975 0.99752 0.99754 0.99756 0.99758 0.9976
## 7 2 3 7 1 1 1 1 35
## 0.99761 0.99764 0.99765 0.99768 0.99769 0.9977 0.99772 0.99774 0.99779
## 1 1 1 3 2 4 1 5 1
## 0.9978 0.99782 0.99783 0.99784 0.99785 0.99786 0.99787 0.99788 0.9979
## 26 2 2 1 1 4 3 2 14
## 0.99791 0.99796 0.99798 0.998 0.99801 0.99803 0.99808 0.9981 0.99814
## 1 1 2 29 2 3 1 10 2
## 0.99815 0.99817 0.99818 0.9982 0.99822 0.99823 0.99824 0.99828 0.9983
## 2 2 3 23 1 1 3 2 9
## 0.99832 0.99834 0.99836 0.9984 0.99842 0.99845 0.9985 0.99852 0.99854
## 1 1 2 20 2 1 3 1 1
## 0.99855 0.99859 0.9986 0.99864 0.99865 0.9987 0.99878 0.9988 0.99888
## 2 1 19 1 2 12 1 20 2
## 0.9989 0.99892 0.999 0.99901 0.9991 0.99914 0.99915 0.99918 0.9992
## 2 3 8 1 10 3 1 1 7
## 0.99922 0.99925 0.9993 0.99935 0.99938 0.99939 0.9994 0.9995 0.9996
## 1 1 4 1 1 1 24 1 12
## 0.99965 0.9997 0.99974 0.99975 0.99976 0.9998 0.9999 1 1.00005
## 1 8 1 1 1 10 1 10 2
## 1.0001 1.00012 1.00015 1.0002 1.00024 1.00025 1.0003 1.0004 1.0006
## 4 1 2 10 1 1 2 9 6
## 1.0008 1.001 1.0014 1.0015 1.0018 1.0021 1.0022 1.00242 1.0026
## 3 6 6 2 1 2 2 2 2
## 1.00289 1.00315 1.0032 1.00369
## 1 3 1 2
The histogram is normal distributed,the peak shows up at around 0.997.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
##
## 2.74 2.86 2.87 2.88 2.89 2.9 2.92 2.93 2.94 2.95 2.98 2.99 3 3.01 3.02
## 1 1 1 2 4 1 4 3 4 1 5 2 6 5 8
## 3.03 3.04 3.05 3.06 3.07 3.08 3.09 3.1 3.11 3.12 3.13 3.14 3.15 3.16 3.17
## 6 10 8 10 11 11 11 19 9 20 13 21 34 36 27
## 3.18 3.19 3.2 3.21 3.22 3.23 3.24 3.25 3.26 3.27 3.28 3.29 3.3 3.31 3.32
## 30 25 39 36 39 32 29 26 53 35 42 46 57 39 45
## 3.33 3.34 3.35 3.36 3.37 3.38 3.39 3.4 3.41 3.42 3.43 3.44 3.45 3.46 3.47
## 37 43 39 56 37 48 48 37 34 33 17 29 20 22 21
## 3.48 3.49 3.5 3.51 3.52 3.53 3.54 3.55 3.56 3.57 3.58 3.59 3.6 3.61 3.62
## 19 10 14 15 18 17 16 8 11 10 10 8 7 8 4
## 3.63 3.66 3.67 3.68 3.69 3.7 3.71 3.72 3.74 3.75 3.78 3.85 3.9 4.01
## 3 4 3 5 4 1 4 3 1 1 2 1 2 2
The histogram is normal distributed,the pH value is between 2.74 and 4.01
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
##
## 0.33 0.37 0.39 0.4 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52
## 1 2 6 4 5 8 16 12 18 19 29 31 27 26 47
## 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67
## 51 68 50 60 55 68 51 69 45 61 48 46 41 42 36
## 0.68 0.69 0.7 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79 0.8 0.81 0.82
## 35 23 33 26 28 26 26 20 25 26 23 18 19 15 22
## 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97
## 15 13 14 13 13 7 7 8 8 5 10 4 2 3 6
## 0.98 0.99 1 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 1.1 1.11 1.12
## 2 3 1 1 3 2 2 3 4 2 3 1 2 1 1
## 1.13 1.14 1.15 1.16 1.17 1.18 1.2 1.22 1.26 1.28 1.31 1.33 1.34 1.36 1.56
## 2 2 1 1 5 3 1 1 1 2 1 1 1 3 1
## 1.59 1.61 1.62 1.95 1.98 2
## 1 1 1 2 1 1
The histogram is right biased and long-tailed, with some outliers at 1.95 and 2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
##
## 0.33 0.37 0.39 0.4 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52
## 1 2 6 4 5 8 16 12 18 19 29 31 27 26 47
## 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67
## 51 68 50 60 55 68 51 69 45 61 48 46 41 42 36
## 0.68 0.69 0.7 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79 0.8 0.81 0.82
## 35 23 33 26 28 26 26 20 25 26 23 18 19 15 22
## 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97
## 15 13 14 13 13 7 7 8 8 5 10 4 2 3 6
## 0.98 0.99 1 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 1.1 1.11 1.12
## 2 3 1 1 3 2 2 3 4 2 3 1 2 1 1
## 1.13 1.14 1.15 1.16 1.17 1.18 1.2 1.22 1.26 1.28 1.31 1.33 1.34 1.36 1.56
## 2 2 1 1 5 3 1 1 1 2 1 1 1 3 1
## 1.59 1.61 1.62 1.95 1.98 2
## 1 1 1 2 1 1
From the log transformed histogram, we are still see the rises and drops at range 0.5 to 0.8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
##
## 8.4 8.5 8.7 8.8
## 2 1 2 2
## 9 9.05 9.1 9.2
## 30 1 23 72
## 9.23333333333333 9.25 9.3 9.4
## 1 1 59 103
## 9.5 9.55 9.56666666666667 9.6
## 139 2 1 59
## 9.7 9.8 9.9 9.95
## 54 78 49 1
## 10 10.0333333333333 10.1 10.2
## 67 2 47 46
## 10.3 10.4 10.5 10.55
## 33 41 67 2
## 10.6 10.7 10.75 10.8
## 28 27 1 42
## 10.9 11 11.0666666666667 11.1
## 49 59 1 27
## 11.2 11.3 11.4 11.5
## 36 32 32 30
## 11.6 11.7 11.8 11.9
## 15 23 29 20
## 11.95 12 12.1 12.2
## 1 21 13 12
## 12.3 12.4 12.5 12.6
## 12 13 21 6
## 12.7 12.8 12.9 13
## 9 17 9 6
## 13.1 13.2 13.3 13.4
## 2 1 3 3
## 13.5 13.5666666666667 13.6 14
## 1 1 4 7
## 14.9
## 1
The histogram reaches a peak at around 9.4 .
There are 1599 wines in the dataset with 12 features(fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioixde, density, pH, sulphates, alcohol and quality). the feature quality is int type, other features are all numeric type. Other observations:
volatile acidity/citric acid/free sulfur dioxide
total sulfur dioxide
no
Yes,there are some features with a long tail,like residual sugar, chlorides, sulphates.
I log transformed those features when plotting them in order to have a better view of the distribution of the data.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
Fixed acidity does not seem to have a relationship with quality.
##
## Pearson's product-moment correlation
##
## data: pf$fixed.acidity and pf$quality
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07548957 0.17202667
## sample estimates:
## cor
## 0.1240516
The correlation coefficient is only 0.12, which confirms the oservation above that fixed acidity does not have a relationship with quality.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
Seems like volatile acidity has a nagtive correlation with quality, we will get the correlation coefficient below to check it out.
##
## Pearson's product-moment correlation
##
## data: pf$volatile.acidity and pf$quality
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
The correlation coefficient is -0.39,there is a weak negtive correlation between volatile and quality.
From the scatterplot, it doesn’t look like that citric acid and quality are correlated .
##
## Pearson's product-moment correlation
##
## data: pf$citric.acid and pf$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
The correlation coefficient is 0.22, which is consistent with our observation.
There is no obvious relationship between residual.sugar and quality observed.
##
## Pearson's product-moment correlation
##
## data: pf$residual.sugar and pf$quality
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03531327 0.06271056
## sample estimates:
## cor
## 0.01373164
The correlation coefficient is only 0.01, there is nearly no relationship between residual sugar and wine quality
For wines with higher quality(5-8), quality score drops down as the chlorides rise, but this is not applied to wines of quality 3 and 4.
##
## Pearson's product-moment correlation
##
## data: pf$chlorides and pf$quality
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.17681041 -0.08039344
## sample estimates:
## cor
## -0.1289066
The coorelation coefficient is - 0.13, that is to say, there is hardly any correlation between chlorides and quality.
Again, there is no correlation with free sulfur dioxide and quality
##
## Pearson's product-moment correlation
##
## data: pf$free.sulfur.dioxide and pf$quality
## t = -2.0269, df = 1597, p-value = 0.04283
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.099430290 -0.001638987
## sample estimates:
## cor
## -0.05065606
##
## Pearson's product-moment correlation
##
## data: pf$total.sulfur.dioxide and pf$quality
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2320162 -0.1373252
## sample estimates:
## cor
## -0.1851003
No correlation again.
##
## Pearson's product-moment correlation
##
## data: pf$density and pf$quality
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2220365 -0.1269870
## sample estimates:
## cor
## -0.1749192
No strong correlation again.
##
## Pearson's product-moment correlation
##
## data: pf$pH and pf$quality
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.106451268 -0.008734972
## sample estimates:
## cor
## -0.05773139
No correlation again.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
##
## Pearson's product-moment correlation
##
## data: pf$sulphates and pf$quality
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
No strong correlation again.
Except a few outliers,quality score goes higer when alcohol gets higher. The quality of wine should be correlated to alcohol.we will get rid of those outliers and get a better plot below.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
Though there are much more scatters in between 9 and 12, we can still see the trend that as alcohl gets higher, the quality gets higher, too. we will check out the correlation coefficient below.
##
## Pearson's product-moment correlation
##
## data: pf$alcohol and pf$quality
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
The correlation coefficient is 0.47, alcohol and quality are correlated .
Based on Scatterplot matrix,alcohol and volatile acifity are the two factors that correlate to the quality of red wine.the coorelation coefficient is 0.48 and - 0.39. this is consistent with our observation above.
Also considerring values of quality are all int, we could creat a new factor variable of quality and plot some box plot to see whether we could find out something interesting.
From the boxplot we can see that though the bad quality wines(quality levels 3 / 4) have an average alcohol of 10 more than that of quality level 5, the good quality wines(levels 7 / 8) have an average alcohol of more than 10.5. The average alcohol of quality level 8 is even about 11.5.
we can see that as quality level goes higher, volatile.acidity drops down.
My featuers of interest used to be volatile acidity/citric acid/free sulfur dioxide, since the description in the txt file made me thought so.But after plotting the scatter plots and checked the correlation coefficient between each feature and wine quality,only volatile has a weak coorelation with wine quality
Yes, alcohol is more correlated to wine quality than other feature do.
Alcohol seems to be the factor that most strongestly correlated to quality of red wine.
Per observstion above, alcohol and volatile acifity are the two factors that correlate to the quality of red wine.the coorelation coefficient is 0.48 and - 0.39. And the next two variables correlated with quality would be sulphates and citric acid, with a correlation coeffcient of 0.25 and 0.23, we will add those variables to the scatterplot of alcohol VS quality and volatile VS quality to see whether we can get something interesting. First we will need to convert the numeric variables sulphates and citric acid to factor with cut function.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality quality_fac sulphates_bucket
## 1 5 5 Sulph_Middle
## 2 5 5 Sulph_High
## 3 5 5 Sulph_High
## 4 6 6 Sulph_Middle
## 5 5 5 Sulph_Middle
## 6 5 5 Sulph_Middle
##
## Sulph_Low Sulph_Middle Sulph_High Sulph_VeryHigh
## 420 409 384 386
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
##
## 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14
## 132 33 50 30 29 20 24 22 33 30 35 15 27 18 21
## 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29
## 19 9 16 22 21 25 33 27 25 51 27 38 20 19 21
## 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44
## 30 30 32 25 24 13 20 19 14 28 29 16 29 15 23
## 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59
## 22 19 18 23 68 20 13 17 14 13 12 8 9 9 8
## 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74
## 9 2 1 10 9 7 14 2 11 4 2 1 1 3 4
## 0.75 0.76 0.78 0.79 1
## 1 3 1 1 1
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality quality_fac sulphates_bucket citric.acid_bucket
## 1 5 5 Sulph_Middle Citric_Low
## 2 5 5 Sulph_High Citric_Low
## 3 5 5 Sulph_High Citric_Low
## 4 6 6 Sulph_Middle Citric_VeryHigh
## 5 5 5 Sulph_Middle Citric_Low
## 6 5 5 Sulph_Middle Citric_Low
##
## Citric_Low Citric_Middle Citric_High Citric_VeryHigh
## 403 449 349 398
Then we will plot the scatterplot of alcohol VS quality and volatile VS quality ,color with slphates_bucket and citric.acid_bucket .
Though not that obvious, roughly the overall trend of the color change is from Low to High as the quality gets higher.
The scatterplot of quality VS alcohol colored by citri.acid_bucket looks similar to the one colored by sulphates. Though not that obvious, the color does change as the quality change.
As the quality goes down, there is a color change.
same as above
The scatterplot of quality VS alcohol colored by sulphates_bucket, facet by citric.acidity_bucket, we can see roughly in the plot of ‘Citric_High’ and ‘Citric_VeryHigh’, when alcohol and sulphates goes higher, the quality score gets bigger.
The scatterplot of quality VS alcohol colored by citric.acidity_bucket, facet by sulphates_bucket. Though there are some exceptions, we can see the trend is when the sulphates and citric acid is higher, as alcohol goes higher, the quality gets better.
We can see that there are more wines with good quality when sulphate is high, citric acidity is high, but volatile is low
Just about same as above.
good quality wine tends to have higher alcohol, lower volatile acidity, higher sulphates and higher citric acidity
Not really.More or less same to the Bivariate Analysis.
No, as we know, the variables alcohol and volatile acidity are not that strongly correlated to quality, a linear model won’t help too much here.
The histogram shows the distribution of the wine quality.
From the plot we can tell that most of the wine quality are 5 or 6.
Only a few are with a quality of 3 or 8 which is very low or very high.
We can see from the scatterplot that as Alcohol goes higher, the quality of wine also goes higher.There is a positive correlation between Alcohol and Wine quality, the correlation is not that strong.
As shown in the scatterplot above, as alcohol ,sulphates and citric acid goes higher, the quality score of wine tends to go higher, too.
First of all, the working directory is so annoying .the setwd function only appies to the chunk, you have to write code at the very beginning of the rmd to set the working directory. But after I did that, set the working directory to the folder where I put all my dataset, what is despairing is whenever you creat a new file, it will show up there ,which is annoying too, so I just quit and made a copy of the dataset in the same folder with the rmd file.That leaves me in peace for a while until I found something weird. When you creat a new file, sometimes it shows up in folder A and sometime folder B, that is really frustrating. I mean,how much interet you would have left to move on to the EDA adventure if you are kept bothered with annoying things like this.
Secondly,I would say I was trapped by some misunderstandings of EDA.I mean, I know exactely what EDA means, but whenever creating a plot, there was a sound keep telling me that there should be something more something valuable there, you should look more deeply to figure it out. Thus I am afraid this kinda exausted me somehow and the interet and passion of playing with data just fade little by little.
Though I thought I did learn a lot through videos and quits,but when it is my term to do my own analysis, basic funcations aside, it is a little difficult to search for the right function to use, though reviewed the videos again, still have a little bit this kind of issues. Hopefully it would be better once more exercises or projects are done.
And when plotting the scatter plots of the features with quality, when almost all the features have no strong correlation with quality, I was like I did not know what I am doing and what I am going to do ,just have to dig into the volatile acidity and alcohol features. But I always have that concern that those would not work, those would not be convincing enough to predict the wine quality or something.
When use cut function to convert numeric variables to factors, the min value in the left is not included, so for the first interval we need to use a value less then the min value, or there would be NA.
TO DO:A model was not created during this analysis since EDA so far did not show a strong relationship between quality and other variables, a linear regression model won’t predict well, in the futher maybe we could thinking about creating another model(logistic regression ,etc) to predict red wine quality. Also the dataset we are using is only 1599 orbs, kinda small, in the future analysis, it would be great if a larger dataset is available.